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ABSTRACT 

He  analyze  a  Bayesian  model  for  response  surfaces.  This  model  augments  a 
simple  graduating  function  with  a  bias  term  that  represents  the  difference 
between  the  true ,  but  unknown,  response  function  and  the  graduating 
function.  The  model  is  a  straightforward  extension  of  Blight  and  Ott's  (1975) 
Bayesian  model  for  polynomial  regression.  The  bias  term  is  defined  in  terms 
of  prior  distributions  and  we  show  how  estimates  of  the  response  surface  and 
measures  of  precision  depend  on  the  form  of  the  prior  distribution.  Estimates 
and  measures  of  precision  are  given  for  strictly  proper  prior  distributions 
and  also  when  improper  priors  are  assigned  to  the  coefficients  of  the 
graduating  function,  in  which  case  the  model  leads  to  generalized  spline 
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SIGNIFICANCE  AND  EXPIANATION 


Scientists  often  wish  to  describe  the  relationship  between  a  response 
variable  and  a  collection  of  explanatory  variables.  When  the  particular 
nature  of  the  relationship  is  unknown,  as  is  often  the  case,  a  common  strategy 
is  to  develop  an  empirical  model  by  using  a  simple  graduating  function  such  as 
a  low-degree  polynomial  to  approximate  the  true  relationship.  The  techniques 
of  response  surface  methodology  were  developed  to  accomplish  this  goal. 

We  consider  a  generalization  of  standard  response  surface  methodolgy  that 
attempts  to  take  into  account  the  approximate  nature  of  the  graduating 
functions  that  are  used.  We  propose  a  model  in  which  the  graduating  function 
is  augmented  by  a  bias  term  that  represents  the  difference  between  the  true, 
but  unknown,  response  function  and  the  graduating  function.  The  bias  term  is 
characterized  in  terms  of  the  scientist's  prior  beliefs  about  its  likely 
magnitude  and  its  likely  similarity  for  similar  combinations  of  the 
explanatory  variables.  The  use  of  prior  information  is  central  to  the 
Bayesian  approach  to  statistical  inference. 

We  derive  estimates  of  the  response  surface  and  measures  of  precision  for 
the  estimates.  It  is  shown  how  both  of  these  depend  critically  on  the  bias 
term.  Two  examples  illustrate  the  technique  and  afford  a  comparison  between 
the  conclusions  of  the  Bayesian  model  and  those  of  standard  response  surface 
models . 


The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
summary  lies  with  MRC,  and  not  with  the  author  of  this  report. 


BAYESIAN  MODELS  FOR  RESPONSE  SURFACES  II: 

ESTIMATING  THE  RESPONSE  SURFACE* 

David  M.  Steinberg 
1 .  INTRODUCTION 

He  consider  the  common  situation  in  which  an  empirical  model  is  sought  to 
describe  the  relationship  between  a  response  variable  Y  and  a  collection 
X1'****'Xk  of  explanatory  variables.  He  assume  that  the  true  dependence  of 
Y  on  *  *  (Xj( • • • • ,  X^)  is  given  by  an  unknown  response  function  g(x)  and 
that  our  goal  is  to  obtain  a  reasonable  approximation  to  g  on  the  basis  of 

i  i  n 

experimental  data  iYi,xiii«i‘  A  standard  approach  to  the  empirical  modeling 
problem  described  above  is  that  of  response  surface  methodology  (see  Box  and 
Hilson  1951,  Box  1954,  Box  and  Youle  1955,  and  Myers  1976),  in  which  a  simple 
graduating  function,  such  as  a  polynomial  of  low  degree,  is  used  to 
approximate  g.  He  will  consider  a  generalization  of  classical  response 
surface  models  that  Includes  not  only  a  simple  graduating  function,  but  also  a 
characterization  of  the  extent  to  which  the  graduating  function  is  believed  to 
accurately  represent  the  true  response  function.  The  model  is  Bayesian  in 
that  the  characterization  is  expressed  in  terms  of  prior  distributions. 

The  model  we  will  analyze  is  a  straightforward  extension  of  a  Bayesian 
model  proposed  by  Blight  and  Ott  ( 1975)  for  the  special  case  of  polynomial 
regression  with  a  single  explanatory  variable.  Smith  (1973),  Young  (1977), 
and  O'Hagan  (1978)  also  proposed  Bayesian  models  for  estimating  an  unknown 
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response  function.  Steinberg  (1984)  showed  that  these  models  are 
equivalent  to  the  extended  Blight-Ott  model  and  that ,  for  certain 
prior  specifications ,  the  models  are  equivalent  to  those  described 
by  Wahba  (1978),  which  yield  generalized  smoothing  splines  as 
estimates  of  g.  Some  of  the  results  in  this  paper  are  included  in 
each  of  the  preceding  papers.  By  combining  their  separate  results 
and  some  new  results,  we  are  able  to  present  a  more  complete 
description  of  Bayesian  response  surface  estimates. 

Section  2  describes  the  model  and  Section  3  presents  the  Bayes 
estimator  of  the  response  function  g  when  all  the  prior  distribu¬ 
tions  in  the  model  are  proper.  Section  4  studies  the  implications 
of  assigning  improper  priors  to  some  of  the  parameters  in  the 
model.  Section  5  presents  some  additional  results  regarding  the 
vector  of  estimated  values  at  the  design  points  and  Section  6  gives 
results  on  the  precision  of  the  estimates.  Section  7  describes  some 
methods  that  might  be  used  to  estimate  additional  parameters  that 
appear  in  the  model.  Two  examples  are  presented  in  Section  8  and 
discussion  and  concluding  comments  are  offered  in  Section  9. 

2.  THE  BAYESIAN  RESPONSE  SURFACE  MODEL 

Classical  reaponae  surface  models  for  a  response  variable  Y 
and  explanatory  variables  x  -  (X1,....,X)c)  can  be  written  as 
multiple  regression  models: 

Yx  -  f’(x)0  +  e,  (2.1) 

where  f(x)  is  a  column  vector  of  functions  of  the  explanatory 
variables,  0  is  a  column  vector  of  unknown  parameters  that  must  be 
estimated,  e  denotes  experimental  error,  and  primes  denote  vector 


(or  matrix)  transposes.  Me  include  the  subscript  x  on  Y  in 

(2.1)  to  emphasize  that  this  is  the  model  for  the  response  variable 
at  the  particular  collection  x  of  explanatory  variables.  He  also 
write  (2.1)  as  an  approximate  equality  (*)  rather  than  an  exact 
equality  to  emphasize  that  f'(x)0  is  assumed  to  be  a  reasonable 
local  approximation  to  the  true  response  function  g(x),  but  is  not 
an  exact  representation.  Throughout  the  paper,  we  will  indicate 
vectors  and  matrices  with  boldface  type. 

Blight  and  Ott  (1975)  suggested  that  (2.1)  could  be  improved, 
in  the  case  of  polynomial  regression,  by  including  an  additional 
term  to  account  for  the  (in) adequacy  of  the  graduating  function 
f' (x)0  to  accurately  represent  g(x).  To  extend  their  approach  to 
the  general  empirical  modeling  situation  described  in  Section  1  is 
perfectly  straightforward  and  we  do  so  here.  We  replace  (2.1)  by: 

Yx  =  f' +  nx  +  e'  ( 2.2 ) 

where  nx  =  g(x)  -  f'(x)0  is  the  bias  at  x  associated  with  the 
particular  graduating  function  that  has  been  chosen.  We  write 

(2.2) ,  unlike  (2.1),  with  an  equals  (*)  sign  to  emphasize  that 
inclusion  of  the  additional  term  nx  is  assumed  to  make  an  exact 
representation  of  g(x)  possible. 

We  now  make  the  following  assumptions  about  the  terms  in  (2.2): 


c  ~  N(0,«2)  i.i.d.  (2. 3a) 

0  ~  N( 0q/V)  (2.3b) 

nx  is  a  continuous  Gaussian  stochastic  process  (2.3c) 
nx  ~  N(°,Tq2R(x,x) )  (2.3d)  v 
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(  2 . 3e ) 


cov(nu,nv)  =  To2R<a,v). 

Assumption  (2.3a)  is  the  common  assumption  that  the  random  error 
terms  are  i.i.d.  normal  deviates  and  assumption  (2.3b)  states  that 
the  parameters  in  the  graduating  function  are  assumed,  a  priori,  to 
have  a  multivariate  normal  distribution.  Thus  assumptions  (2.3a,b) 
are  the  standard  assumptions  for  a  Bayesian  multiple  regression 
analysis*  A  special  case  that  has  been  afforded  special  attention 
(see  Lindley  and  smith  1972,  Blight  and  Ott  1975,  Steinberg  1984)  is 
when  the  regression  coefficients  are  assigned  a  vague  prior 
distribution.  Following  Lindley  and  Smith  (1972),  a  vague  prior  can 
be  obtained  from  (2.3b)  by  considering  limiting  forms  as  V-1  ♦  0. 

The  special  feature  of  (2.2)  is  the  introduction  of  the  "bias" 
function  nx  directly  into  the  statistical  model.  Assumptions 
(2.3c-e)  are  designed  to  provide  a  characterization  of  the  extent  of 
the  bias  and  justification  for  these  assumptions  is  made  by  a  direct 
appeal  to  prior  belief  (see,  for  example.  Blight  and  Ott  1975,  p. 
80).  First,  it  is  assumed  that  a  graduating  function  has  been 
chosen  which  captures  the  primary  features  of  the  dependence  of  Y 
on  x?  therefore,  it  is  reasonable  to  anticipate  that  the  bias  at 
any  given  point  will  be  0.  The  prior  variances  in  (2.3d)  express 
the  believed  extent  of  bias  throughout  the  explanatory  variable 
space.  Thus,  for  example,  if  the  graduating  function  is  thought  to 
provide  a  good  approximation  in  one  region  of  the  factor  space,  but 
may  be  increasingly  susceptible  to  bias  outside  that  region  (see, 
for  example.  Box  and  Draper  1959),  the  prior  variances  in  (2.3d)  can 
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be  defined  to  reflect  that  belief.  Finally,  the  prior  covariances 
in  (2.3e)  can  be  chosen  to  reflect  the  likely  similarity  of  the  bias 
at  proximate  locations  in  the  factor  space,  which  is  closely  related 
to  prior  beliefs  about  the  "smoothness”  of  the  true  response 
function.  The  particular  parameterization  for  the  covariance 
function  emphasizes  two  aspects:  (i)  it  is  natural  to  measure  bias, 

not  absoutely,  but  relative  to  the  magnitude  of  experimental  error, 

2 

so  a  has  been  factored  out  of  (2.3e);  and  (ii)  it  is  natural  to 
further  decompose  the  covariance  function  into  a  "standardized" 
covariance  function  R(u,v)  and  a  proportionality  constant  T  that 
indicates  the  overall  magnitude  of  the  bias,  relative  to  that  of 
experimental  error.  The  standardization  of  the  covariance  function 
might  be  accomplished  in  a  number  of  ways.  If,  for  example, 

Var{nx}  is  assumed  to  be  constant  (as  in  Smith  1973  and  Blight  and 
Ott  1975),  then  it  would  be  natural  to  let  R(u,v)  be  the 
corresponding  correlation  function.  Alternatively,  R(x,x)  might 
be  set  equal  to  1  for  some  particular  point  x;  then  T  would 
measure  the  relative  extent  of  bias  at  that  point.  Assumptions 
(2.3c-e)  generalize  those  made  by  Blight  and  Ott  (1975),  who 
proposed  a  particular  parametric  form  for  the  covariance  function 
which  they  felt  was  appropriate  for  polynomial  regression. 

Steinberg  (1984)  showed  how  (2.2)-(2.3)  is  related  to  models 
that  were  proposed  by  Smith  (1973)  and  O'Hagan  (1978)  and  to  the 
generalized  smoothing  spline  estimates  of  Wahba  (1978),  and  also 
showed  how  (2. 2)- (2. 3)  can  be  expressed  in  terms  of  a  generalized 
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Fourier  series  for  the  response  function.  We  briefly  indicate  the 


form  of  these  models  so  that  we  can  relate  the  results  of  later 
sections  to  them.  See  Steinberg  (1984)  for  further  details. 

Smith  (1973)  proposed  a  hierarchical  model  for  the  nxl  data 
vector  Y: 


Y/e 

~  N(0,O2I) 

(2.4a) 

e/e 

~  N(XB,T02R) 

(2.4b) 

0  ~ 

N(e0,V), 

(2.4c) 

where  0  is  the  vector  of  expected  values  of  the  observations,  X 
is  the  nxp  matrix  whose  ith  row  is  given  by  f* (x^),  and  R  is 
the  nxn  matrix  whose  i, jth  entry  is  Rtx^Xj).  O'Hagan  (1978) 
generalized  (2.1)  by  allowing  the  parameter  vector  0  to  be  a 
function  of  x  and  used  prior  distributional  assumptions  to  express 
the  believed  variability  of  P(x). 

Wahba  (1978)  proposed  estimating  g(x)  by  finding  that 
function  gn>T(x)  in  a  Hilbert  space  that  minimizes: 

n  2  1 

l  [y  -  h(x  )]z  +  T-  J(h),  (2.5) 

i=1 

where  J  is  a  semi-norm  on  the  Hilbert  space  that  is  typically  a 
roughness  penalty.  Wahba  proved  that  the  estimate  g  obtained 

n  ^  T 

from  (2.5)  is  also  the  solution  to  a  Bayesian  estimation  problem  and 
Steinberg  (1984)  showed  that  the  appropriate  Bayesian  model  is 
precisely  (2.3)  when  the  regression  coefficients  are  assigned  a 
vague  prior  distribution.  (Note  that  r  in  (2.5)  is  equal  to 
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in  Wahba’s  formulation.)  Wecker  and  Ansley  (1983)  also 


r 


(nX)"1 


studied  this  model. 

Steinberg  (1984)  showed  that  (2.3)  also  results  from 
consideration  of  a  Bayesian  model  for  a  generalized  Fourier  series 
expansion  of  g(x): 


g(x)  =  I  B.f.(x)  +  y  e.g.(x),  (2.6a) 

1  i  i’i 

j=1  J  J  i=0 

where  B  ~  N(f»0,V)  (2.6b) 

and  9^  ~  N(0,TO2m2),  independent.  (2.6c) 

Defining  nx  to  be  the  second  summation  in  (2.6a)  yields  (2.3)  with 


R(u,v) 


00 


gi(a)  g±(v). 


provided  the  above  series  converges  whenever  u  =  v.  Moreover, 
(2.3)  admits  such  an  expansion  whenever  the  factor  space  is  a 
compact  set. 


i 


3.  BAYESIAN  ESTIMATES  WITH  PROPER  PRIORS 


In  this  section  we  derive  estimates  of  the  response  function 
g(x)  and  the  regression  coefficients  based  on  the  Bayesian  model 
defined  by  (2.2)-(2.3)  of  Section  2.  Throughout  this  section,  we 
will  assume  that  the  regression  coefficients  have  been  assigned  a 
proper  prior  distribution;  estimation  with  an  improper  prior  will  be 
discussed  in  Section  4. 

Consider  first  the  problem  of  estimating  the  response 
function  g  at  the  point  x=(X^, . . . . /X^)  in  the  factor  space*  The 
Bayesian  model  presented  in  Section  2  represents  g(x)  as 
f'(x)8  +  nx  and  so  provides  a  prior  distribution  for  g(x).  A 
natural  way  to  estimate  g(x),  then,  is  to  find  its  posterior 
distribution;  i.e.,  to  find  the  conditional  distribution  of  g(x) 
given  the  observed  data*  Similarly,  estimates  of  the  vector  of 
regression  coefficients  should  be  based  on  the  posterior 
distribution  of  8*  To  find  these  distributions,  denote  the 
experimental  responses  by  the  nxl  random  vector  Y.  Then; 

Y  =  X8  +  n  +  e  (3.1a) 

g(x)  =  f' (x)8  +  hx  (3.1b) 

where  X  is  the  nxp  matrix  whose  ith  row  is  fMx^), 

n  =  (n, 

errors  for  the  observed  data.  Applying  (2.3),  simple  computations 
reveal  that  the  joint  distribution  of  ( Y* ,g(x) ,8' ) '  is 
multivariate  normal  with  expected  value:  ( (X8q) * ' (x)8q»8q' ) ' 
and  covariance  matrix; 


l1 


,  ,nx  )',  and  e  is  the  nxl  vector  of  experimental 

*n 


where  R  is  the  nxn  matrix  whose  i , jth  entry  is  RJx^Xj),  I  is 
the  nxn  identity  matrix,  and  r(x)  is  an  nxl  vector  whose  ith 
entry  is  R(x^,x). 

Since  the  joint  distribution  of  (Y* ,g(x),B‘)  is  multivariate 
normal,  the  posterior  distributions  of  g(x)  and  0  will  also  be 
normal.  Thus  the  natural  point  estimates  of  g(x)  and  0  based  on 
their  posterior  distributions  will  be  their  posterior 
expectations.  Certainly,  if  the  estimation  problem  is  placed  in  a 
decision-theoretic  framework,  the  posterior  expectations  will  be  the 
best  estimates  with  respect  to  all  symmetric  loss  functions.  We 
give  the  posterior  expectations  in  the  following  theorem. 

Theorem  3.1:  Suppose  that  the  model  is  specified  by  (2.2)-(2. 3)  and 
that  the  observed  data  are  Y*y.  Denoting  the  posterior  expectation 
of  g(x)  by  g(x),  we  have: 

A 

g(x)  -  E{g(x)/Y-y} 

-  f'(x)Bc  +  [T02r*  (x)  +  f'{x)W  ] 

2  2  -1 

x  (o  I  +  to  R  +  XVX*  )  (y  -  XB0>.  (3.2) 

The  posterior  expectation  of  B  is : 

2  2  “1 

E { B/T-y }  =  B0  +  VX'(0  I  +  to  R  +  XVX’)  (y  -  X0Q).  (3.3) 

Proof :  The  proof  follows  directly  from  standard  results  for 
conditional  expectations  of  multivariate  normal  distributions  (see, 
for  example,  Anderson  1958,  p.  28). 


9 


Although  Theorem  3.1  is  straightforward,  it  is  not  particularly 


revealing.  The  following  results  provide  more  intuition  into  the 
form  of  the  estimates.  We  begin  by  observing  how  (3.2)  and  (3.3) 
are  related. 

Corollary;  (i)  The  posterior  expectation  of  g(x)  has  a  natural 

decomposition  as  the  sum  of  a  graduating  function  whose  coefficients 

are  estimated  by  (3.3)  and  a  second  term,  which  Blight  and  Ott  call 

2 

the  correction  for  bias,  that  depends  on  r(x),  t,  and  o  : 

A 

g(x)  =  f'  (x)E{$/Y-y}  +  ElWY^y} ,  (3.4) 

where 

E{rix/Y«y}  -  to2r'  (x) (0*1  +  tcj2R  +  XVX*)_1(y  -  *P0>  (3.5) 

(ii)  The  bias  term  (3.5)  can  be  expressed  in  terms  of  the  residuals 

at  the  design  points  when  the  graduating  function  alone  is  fit. 

Denote  by  Y  the  vector  of  predictions  that  results  from  fitting 
GF 

A  A 

just  the  graduating  function;  Y  «  XE{B/Y«y},  and  denote  by  e 

GF  GF 

A  A 

the  corresponding  residual  vector;  e__=  y  -  Y  .  Then 

GF  GF 

E{n  /Y*y}  =  tr' (x)(I  +  TR)  *•  . 

X  GF 

Proof ;  Part  (ii)  of  the  corollary  is  trivial.  Part  (i)  can  be 
verified  directly,  using  the  same  approach  as  in  Theorem  3.1. 
Alternatively,  since  g(x)  *  f'(x)B  +  nx,  it  follows  that 
E { g( x)/Y“y}  *  e| f ' (x)$/Y«y}  +  Efn^/Y-y} 

*  f' (x)ElB/Y-y}  +  E{nx/Y-y} 

which  proves  (3.4)  and,  along  with  equations  (3.2)  and  (3.3), 
implies  (3.5). 

The  corollary  provides  valuable  insight  into  the  role  of  the 


« 
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2 

bias  covariance  function  to  R(n,v)  in  estimating  g(x). 
Substituting  this  expression  into  the  second  term  of  (3.4)  yields: 

*  2  n 

g(x)  »  f ' (x)e{ P/T*y}  +  to  l  a  R(x,x  ),  (3.6) 

i«1 

where  the  coefficients  a^  are  estimated  from  the  data.  Thus  the 
estimation  equation,  as  a  function  of  x,  combines  a  graduating 
function  with  a  linear  combination  of  n  functions  that  are 
completely  determined  by  the  form  of  the  covariance  function 
R(n,w)  and  the  choice  of  design  points. 

Equation  (3.6)  provides  a  broad,  flexible  class  of  estimators 
for  response  surfaces.  Moreover,  it  may  suggest  useful  guidelines 
for  choosing  a  prior  covariance  function,  since  some  choices  will 
lead  to  especially  appealing  estimation  equations  while  others  may 
have  undesirable  consequences.  For  example.  Blight  and  Ott  (1975) 
considered  the  special  case  of  univariate  polynomial  regression  and 
suggested  using  R(u,v)  “  V^,  0<A<1,  which  is  the  covariance 

function  for  a  first  order  autoregressive  process.  It  can  be  seen 

A 

from  (3.6)  that  this  choice  leads  to  an  estimate  g(x)  whose 
derivative  is  discontinuous  at  each  design  point.  If  it  is  believed 
that  the  response  function  has  a  continuous  derivative,  then  this 
prior  covariance  function  leads  to  a  poor  representation  of 
posterior  belief. 

Smith’s  (1973)  prior  specification  for  univariate  regression, 
whereby  R  ■  I,  is  also  called  into  question.  (Smith  made  prior 


assumptions  only  about  the  matrix  R,  not  about  the  complete 


covariance  function.)  In  general,  R  will  be  equal  to  the  identity 
only  if  R(u,v)  decreases  rapidly  as  |u  -  v|  increases,  in  which 
case  RfXjX^  will  have  a  rather  sharp  "spike"  around  x^.  From 
(3.6),  the  resulting  estimate  g(x)  will  deviate  from  the 
graduating  function  only  in  the  vicinity  of  the  design  points,  but 
these  deviations  may  be  quite  sharp.  Ibis  also  would  seem  to  be  an 
unlikely  summary  of  posterior  belief  about  the  nature  of  the 
response  function. 

O'Hagan  (1978)  also  considered  in  detail  the  case  of  univariate 
polynomial  regression  and  suggested  a  covariance  function  of  the 
form: 

R(u,v)  -  exp{-(u-v)2/d2}  x  f'(u)W(v), 
o 

where  d  is  a  scaling  parameter  and  W  is  a  positive  definite 

pxp  matrix.  This  choice  is  similar  to  Blight  and  Ott's,  but  there 

are  two  differences.  First,  R(u,u)  will  now  typically  be  an 

2k 

increasing  function  of  u  ,  where  k  is  the  degree  of  the 
polynomial,  instead  of  a  constant,  as  in  Blight  and  Ott's 
specification;  second,  R(u,v)  now  decreases  exponentially  in  the 
square  of  |u  -  v|.  An  important  consequence  of  the  latter  change 

A 

is  that  g(x)  will  now  be  an  analytic  function  of  x  instead  of  a 
function  with  a  discontinuous  first  derivative. 

Wahba  (1978)  devoted  particular  attention  to  univariate 
polynomial  regression  on  the  interval  (0,1)  and  recommended 
choosing  R  to  produce  polynomial  spline  estimates.  In  particular. 


for  the  popular  cubic  splines,  the  graduating  function  is  a  straight 


line  and: 


R(u,v) 


( 3u2v  -  u3)/6 
(3uv2  -  v3)/6 


if  u  <  v 


if  v  <  u. 


which  is  the  covariance  function  for  an  integrated  Brownian  motion 
on  the  unit  interval  (see  Shepp  1966).  Note  that  for  fixed  v, 
R(u,v)  is  itself  a  cubic  spline  as  a  function  of  u;  since  a  linear 
combination  of  cubic  splines  is  again  a  cubic  spline,  it  follows 
from  (3.6)  that  this  choice  of  R(u,v)  will  lead  to  an  estimate 

A 

g(x)  which  is  a  cubic  spline.  (Note,  however,  that  (3.6)  will  not 
give  Hahba's  cubic  spline  estimate  of  g(x),  which  results  only 
when  the  regression  coefficients  are  assigned  a  vague  prior;  the 
results  above  show  that  even  with  a  proper  prior,  it  is  possible  to 
obtain  an  estimate  that  is  a  cubic  spline . ) 

If  a  Bayesian  generalized  Fourier  series  representation  of 
g(x)  is  used  as  a  basis  to  derive  the  covariance  function,  then  the 

A 

estimate  g(x)  can  also  be  written  as  a  generalized  Fourier 
series.  In  particular,  if  the  experiment  is  modeled  by  (2.6a-c), 
then  substituting  (2.7)  into  (3.6)  yields: 


g(x)  -  f' (x)E{&/Y*yf  +  l  b  g  (x), 

j=0  3  3 


(3.7) 


2  V 

where  b  *  to  l  a 
3  i*1 

the  same  coefficients  that  appear  in  (3.6).  Moreover,  it  is  easy  to 


iVV' 


The 


in  the  last  expression  are 


show  that: 


bj  *  E{8j/X*y}. 

Thus,  if  the  generalized  Fourier  series  approach  is  employed,  g(x) 
is  a  series  of  exactly  the  same  form  as  g(x),  but  with  the 
coefficients  of  all  the  terms  in  the  series  estimated  by  their 
posterior  means. 


4.  BAYESIAN  ESTIMATES  WITH  IMPROPER  PRIORS 

The  authors  of  the  Bayesian  models  that  were  described  in 
Section  2  have  argued  that  it  is  often  appropriate  to  assume  an 
improper  prior  distribution  for  the  regression  coefficients  8.  In 
this  section,  we  explore  the  consequences  of  such  an  assumption  for 
estimating  the  response  surface  and  the  regression  parameters.  For 
discussion  of  the  significance  of  assigning  8  an  improper  prior, 
see  Steinberg  (1984,  Sect.  5.3). 

As  noted  in  Section  2,  a  natural  way  to  assign  8  an  improper 
prior  is  to  consider  limiting  forms  as  its  prior  precision  matrix 
V-1  converges  to  a  0  matrix:  i.e.,  to  retain  the  form  of  a  normal 
distribution  but  to  allow  the  prior  variance  of  that  distribution  to 
become  arbitrarily  large.  We  can  then  derive  the  posterior 
distributions  of  g(x)  and  8  by  taking  limits  of  the 
corresponding  posterior  distributions  when  a  proper  prior  is  used. 
Specifically,  we  will  derive  limiting  values  for  their  posterior 
means  (in  this  section)  and  variances  (in  Section  6)  and  will  show 
that  these  limits  exist  provided  that  the  X  matrix  has  full  column 


rank.  Since  the  posterior  distributions  are  normal  when  a  proper 
prior  is  used,  the  corresponding  posterior  densities  must  converge 
to  a  normal  density  with  the  limiting  mean  and  variance  as  its 
parameters.  Applying  a  theorem  of  Scheffe  (1947),  we  conclude  that 
the  limiting  posterior  distributions  are  also  normal.  As  such,  the 
posterior  means  will  be  the  natural  estimates  of  the  response 
function  and  the  regression  parameters  and  we  give  these  in  the 
following  theorem. 

Theorem  4. 1 :  Suppose  X  has  full  column  rank  and  an  improper  prior 
is  assigned  to  the  regression  coefficients.  Then  the  posterior  mean 
of  g(x)  is: 

A 

g(x)  -  lim  e{ g(x)/Y*y}  (4.1) 

+  o 

-  f'(x)[x«ll"1xj'1x-ll“1y  + 

T02r'(x>{lf 1-  *"1x[x-«"1xJ"1X-*"1}y, 
where  M  »  (I  +  TR) .  The  estimated  regression  coefficients  are: 

lim  E|B/T-y}  -  [x'*"1xj”1X,*"1y  (4.2) 

v-1-*.  0 

Before  proceeding  with  the  proof,  we  note  that  this  result  is  by  no 
means  new;  in  fact.  Theorem  4.1  is  only  a  slight  generalization  of 
Theorem  2  in  Wahba  (1978)  and  Hieorem  2  in  O'Hagan  (1978),  both  of 
which  assume  that  V  *  kl. 

Proof :  The  proof  relies  on  two  simple  matrix  identities,  which  we 
state  as  lemmas.  In  both  of  the  lemmas,  we  assume  that  the 
dimensions  of  the  matrices  are  such  that  the  indicated  matrix 
operations  are  well-defined  and  that  all  the  necessary  matrix 


inverses  exist 


Lemma  It 


(4.3) 


(A  +  XVX'  )_1  -  A-1  -  A_1x(  A*A_1X  +  v"1)"1*^”1. 

-1  -1  -1  -1  -1 

Lemma  2:  VX'  (A  +  XVX')  -  (X'A  X  V  )  X‘A  (4.4) 

Both  of  the  lemmas  can  be  verified  directly;  an  interesting 
statistical  proof  of  Lemma  1  was  given  by  Lindley  and  Smith  (1972) 
using  properties  of  a  hierarchical  Bayesian  linear  model.  The  proof 
of  the  main  result  follows  directly  upon  using  the  lemmas  to  rewrite 
(3.2)  and  (3.3)  with  A  *  a2*  «  o2(I  +  TR) ,  noting  that: 

(X'A_1X  +  V-1)"1  ♦  <J2(X'*"1X>“1  as  V"1  ♦  0 
provided  that  X  has  full  column  rank.  Details  are  given  in  the 
Appendix.  Whereas  earlier  proofs  required  several  complicated 
limits,  only  the  simple  limiting  result  above  is  needed  here. 

As  with  the  estimates  (3.1)  and  (3.2)  (when  proper  priors  are 
used ) ,  (4.1)  can  be  decomposed  into  the  sum  of  a  graduating  function 
whose  coefficients  are  estimated  by  (4.2)  and  a  "correction  for 
bias"  term: 

lim  E{g(x)/X*y}  -  f  {*)  li®  eI  P/lVy} 
v“'+  0  v  +  0 

+  lim  E{n  /T«y} ,  (4.5) 

V  ♦  0  x 

where  the  latter  term  is  given  by  the  final  line  of  (4.1).  Also, 
the  estimated  response  function  can  once  again  be  written,  as  in 
(3.6),  as  the  sum  of  a  graduating  function  and  n  functions  whose 
form  is  completely  determined  by  the  covariance  function  (2.3e): 
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n 

lim  E{g(x)/Y*y}  *  f  (x)  lii^  E{&/Y«y}  +  t  [  a^lx.x  ), 

V  ♦  0  V  ♦  0  i-1 

(4.6) 

where  the  coefficients  a^  are  estimated  from  the  data.  Thus  all 
the  comments  of  the  previous  section  on  the  role  of  the  covariance 
function  R(u,w)  in  the  estimated  response  function  are  also  valid 
if  the  regression  coefficients  are  assigned  an  improper  prior.  In 
particular,  if  the  covariance  function  is  derived  using  a 
generalized  Fourier  series  model,  the  estimate  will  be  a  generalized 
Fourier  series  of  the  same  form  whose  coefficients  will  be  estimated 
bys  lim  e{8  /Y-y}. 

v"1*  0  3 


There  are  two  important  differences  in  the  estimated  response 

function  that  result  from  assuming  a  vague  prior  distribution  for 

the  regression  coefficients.  First,  the  estimation  equation  no 

longer  depends  on  the  prior  mean  for  these  coefficients.  The 

regression  coefficients  are  now  estimated  on  the  basis  of  the  data 

2 

alone.  Second,  the  estimation  equation  is  independent  of  o  ,  but 
does  depend  on  the  parameter  t  which  expresses  the  overall 
magnitude  of  bias  relative  to  that  of  experimental  error. 

The  estimate  (4.2)  for  the  regresf ion  coefficients  is  clearly 
the  generalized  least  squares  estimate  that  corresponds  to  the 
model  Y  *  X®  +  e  in  which  the  error  terms  are  correlated  with 
E{ee'}  ■  o^(I  +  TR),  and  is  the  maximum  likelihood  estimate  for 
this  model  if  it  is  further  assumed  that  the  errors  have  a 
multivariate  normal  distribution.  These  facts  suggest  a 


17 


justification  for  (4.2)  as  an  estimator  of  0  in  classical  sampling 
theory  terms.  If  the  proposed  regression  model  is  only  an 
approximation  to  the  true  response  function,  then  deviations  from 
that  model  will  be  the  sum  of  two  components,  one  due  to 
experimental  error  and  the  other  due  to  bias.  It  is  common  to 
assume  that  experimental  errors  are  independent  of  one  another,  but 
such  an  assumption  seems  quite  implausible  for  errors  due  to  bias. 
Thus  the  resulting  model  should  involve  correlated  error  terms  in 
which  the  correlations  reflect  the  extent  of  the  bias,  just  as  in 
the  above  model.  Of  course,  such  an  argument  does  not  suggest  what 
the  precise  form  of  the  error  covariance  matrix  should  be.  (One 
possibility  might  be  to  assume  a  simple  parametric  form  for  the 
covariance  matrix  and  then  to  maximize  the  resulting  likelihood 
function  over  all  the  parameters.)  Also,  the  frequentist  approach 
would  lead  to  an  estimator  that  includes  only  the  first  term  of 
(4.1),  the  best  fitting  graduating  function  in  light  of  experimental 
error  and  bias;  the  second  term,  which  gives  the  adjustment  for 
bias,  arises  only  in  the  Bayesian  context. 

Allowing  T  to  range  from  0  to  infinity  permits  us  to  model  a 
wide  range  of  situations,  from  those  in  which  experimental  error  is 
dominant  (when,  say,  scientific  knowledge  provides  an  exact  form  for 
the  response  function)  through  those  in  which  the  bias  is  dominant 
(as  might  be  the  case  in  numerical  analysis).  We  now  consider  the 
estimates  that  result  in  these  limiting  cases. 

Theorem  4.2:  Given  the  model  (2.2)-(2.3)  with  an  improper  prior 


assigned  to  the  regression  coefficients,  the  estimates  (4.1)  and 

(4.2)  have  the  following  limiting  forms  as  T  ♦  0: 

lim  lim  e{  g(x)/Y=y}  =  f'(*){I'I)‘'ry.  (4.7) 

T+0  V  ♦  0 

lim  lii^  E{0/T-y}  -  (X'*)“1X'y.  (4.8) 

T+0  V  -*•  0 

If  R  is  non-singular,  then  the  estimates  (4.1)  and  (4.2)  have  the 

following  limiting  forms  as  T  *  ®: 

lim  liif  E{g(x)/Y=y}  =  f‘  (x)  (X'r”1*)""1*'*”^  +  (4.9) 

T-***  V  ♦  0 

r'  (*){r_1-  R-1X(X,R-1*)~1X,R-1}y. 
lim  lisy  E{e/Y-y}  =  (Y'R"'1X)“1X,R-1y  (4.10) 

T+«  V  >  0 

Proofs  Equations  (4.7)  and  (4.8)  follow  directly  from  (4.1)  and 
(4.2)  upon  noting  that  (I  +  TR)  ^  ♦  I  as  T*0.  To  obtain 
equations  (4.9)  and  (4.10),  we  rewrite  (4.1)  and  (4.2),  replacing 
(I  +  TR)-1  by  t-1(t-1I  +  R)-1.  After  some  cancellation,  the 
expressions  now  depend  on  T  only  through  (t-1I  +  R)-1,  which 
converges  to  R  1  as  T  ♦  •»,  provided  that  R  is  non-singular, 
and  results  in  (4.9)  and  (4.10). 

The  first  half  of  Theorem  4.2  yields  familiar  answers:  the 
ordinary  least  squares  estimators.  Thus,  ordinary  least  squares 
obtains  as  a  limiting  case  of  the  Bayesian  model  when  the  regression 
parameters  have  an  improper  prior  and  when  the  bias  is  assumed  to  be 
negligible  relative  to  experimental  error  (i.e.  when  the  graduating 
function  is  assumed  to  exactly  represent  the  response  function). 

The  second  half  of  Theorem  4.2  does  not  have  so  immediate  an 


interpretation,  but  the  following  section  will  clarify  what  happens 
when  T  tends  to  infinity. 


1 

I 


5.  SPECIAL  FORMS  FOR  THE  Y  VECTOR 
In  this  section  we  derive  some  results  describing  the  vector 

a  a 

Y  of  precicted  values  whose  ith  entry  is  g(x^),  where  x^  is 
the  ith  design  point.  We  begin  by  examining  the  relationship 

A 

between  Y  and  T ,  the  presumed  extent  of  bias  relative  to 
experimental  error,  when  the  regression  coefficients  are  assigned  an 
improper  prior. 

Theorem  5.1;  Suppose,  given  the  model  (2.2)-(2.3),  that  X  has 
full  column  rank  and  that  R  =  (r(x^,x.j))  is  non-singular.  Then: 

A 

lim  lit^  Y  »  y?  (5.1) 

T-m»  V-'-*-  0 

i.e.,  the  estimation  equation  interpolates  the  observed  data. 
Replicate  observations  at  any  design  points  will  contribute 
identical  rows  and  columns  to  R,  making  it  singular.  Suppose, 
however,  that  elimination  of  all  identical  rows  and  columns  yields  a 
non-singular  matrix.  Then: 

A 

Y^  =  average  of  all  observations  at  x^ .  (5.2) 

Proof :  Noting  that  the  ith  row  of  X  is  f ' ( x^ ) ,  and  that  the 
ith  row  of  R  is  r'(x^),  (5.1)  is  an  immediate  consequence  of 
(4.9).  To  obtain  (5.2),  observe  that  if  there  are  replicate 
observations  at  X^,  all  the  information  they  provide  about  g(x) 
is  contained  in  their  average.  Thus  we  can  compute  (3.2),  (4.1), 
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and  (4.9)  conditioning  only  on  the  vector  of  replicate  averages, 
rather  than  the  entire  data  vector,  which  leads  to  (5.2).  Details 
are  given  in  the  Appendix. 

Combining  the  results  of  Theorem  4.2  and  Theorem  5.1  provides 
considerable  insight  into  how  the  estimated  response  function  g(x) 
behaves  as  a  function  of  T:  it  varies  from  the  least  squares 
graduating  function  (when  i  =  0)  to  an  interpolant  (when  t  +  °°). 

As  an  intuitive  justification  for  the  latter  result,  we  might  think 
of  T  +  “  as  an  appropriate  way  to  model  data  that  are  not  subject 
to  experimental  error,  so  that  the  observed  responses  are  exact 
values  of  the  response  function.  The  estimated  response  function 
reflects  this  certain  knowledge  by  correctly  estimating  the  response 
at  those  points. 

The  above  description  of  how  the  estimated  response  function 

depends  on  T  is  well-known  in  the  spline  literature  when  all  the 

design  points  are  distinct  (see,  for  example,  Kimeldorf  and  Wahba 

1971),  but  has  generally  not  been  extended  to  cover  replicates  nor 

has  it  been  observed  in  connection  to  the  Bayesian  models.  Blight 

and  Ott,  for  example,  proposed  a  parametric  form  for  the  covariance 

function  R  and  then  suggested  that  these  parameters  and  t  be 

jointly  estimated  by  minimizing  the  residual  sum  of  squares, 

*  2 

S(H,t)  =  ^  (y ^  )  .  It  is  clear  from  Theorem  5.1  that  S(R,T) 

will  always  be  minimized  when  t  regardless  of  the  values  of 

the  other  parameters. 

A  common  property  of  Bayes  estimates  using  proper  priors  is 


that  they  can  be  expressed  as  weighted  averages  of  their  prior  means 


and  the  observed  data.  This  is  not  possible  for  g(x)  for  an 
arbitrary  point  x  because,  in  general,  no  observation  has  been 
made  there;  for  the  n  observed  data  points,  however,  and  for  the 
estimated  regression  coefficients,  we  can  write  such  weighted 
averages . 

Theorem  5.2:  Given  the  model  (2.2)-(2.3),  the  predicted  value 
vector  Y  can  be  expressed  as  a  weighted  average  of  its  prior 
mean  x6q  and  the  observed  response  vector  y,  where  the  weights 

are  inversely  proportional  to  the  respective  measures  of 

2  _  2 

variation,  to  R  +  XVX'  and  o  : 

y  =  [o_2i  +  (to2r  +  xvx*  r1]-1 

x  [o“2y  +  <to2R  +  XVX' )-1XB0].  (5.3) 

Similarly,  the  estimated  regression  coefficients  can  be  written  as  a 
weighted  average  of  their  prior  mean  8g  and  the  observed 
responses,  with  the  weights  inversely  proportional  to  the  prior  and 
data  covariance  matrices,  respectively: 

E{e/Y=y}  =  [a_2X'(I  +  TR)_1X  +  v" 1 ] “ 1 

x  [a  x'(l  +  TR)  y  +  V  80J.  (5.4) 

Proof :  We  exploit  here  the  hierarchical  model  formulation  (2.4a-c) 
proposed  by  Smith  (1973).  In  this  formulation,  e{y/0(  =  8,  so 

A 

that  Y  =  E{fl/Y=y}.  Equation  (5.3)  is  then  a  special  case  of  a 
theorem  proved  by  Lindley  and  smith  (1972,  equations  12  and  13).  To 
prove  (5.4),  we  combine  (2.4a)  and  (2.4b)  to  obtain  the  distribution 
of  Y  conditional  on  fl  without  the  mediating  parameter  9: 
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Y/3  ~  N(X$,o2(i  +  tr)). 

This,  together  with  (2.4c)  matches  the  assumptions  of  a  lemma  proved 
by  Lindley  and  Smith  (1972);  their  equations  (7)  and  (8)  imply 
(5.4). 

Steinberg  (1984),  exploiting  results  of  Wahba  (1978),  showed 
that  the  Bayesian  model  considered  here  leads  to  generalized  spline 
estimates  when  the  regression  coefficients  are  assigned  an  improper 
prior.  The  underlying  motivation  for  spline  estimates  is  to  find  a 
reasonably  "smooth"  function  that  closely  follows  the  observed 
data.  Spline  estimates  are  derived  as  solutions  to  the  minimization 
problem  (2.5)  stated  in  Section  2.  In  the  following  theorem,  we 

A 

show  that  Y  is  the  solution  to  a  discrete  analogue  of  (2.5). 

A 

Theorem  5.3:  Y  solves  the  minimization  problem:  find  u  to 
minimize 

<u-y)'(w-y)  +  (u-X80)'o2(to2R  +  XVX* )-1 (m-*»0) .  (5.5) 

As  V  1  ♦  0,  (5.5)  tends  to: 

(u-y)'(u-y)  +  t"1u'[r"1-  R~1X(X'R~1X)"1X'R_1]u.  (5.6) 

Moreover,  the  second  term  in  (5.6)  is  0  if  and  only  if  u  e  col(X); 
that  is,  if  and  only  if  u  can  be  written  as  a  linear  combination  of 
the  columns  of  X. 

Proof :  The  proof  is  given  in  the  Appendix. 

A 

Both  (5.5)  and  (5.6)  describe  the  vector  Y  of  predicted 
values  as  the  solution  to  a  minimization  problem  composed  of  two 

A  A 

terms:  the  residual  sum  of  squares,  (Y-y)'(T-y),  and  a  quadratic 
penalty  term.  For  the  general  case  (5.5),  the  quadratic  term 
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penalizes  Y  in  accord  with  its  distance  from  its  prior 

expectation,  This  has  the  effect  of  "shrinking"  the  vector  of 

predicted  values  toward  its  prior  expectation.  The  extent  of  the 

2  2  _ —  1 

shrinkage  depends  on  the  weighting  matrix  a  (to  R  +  XVX* )  ,  which 

2 

is  proportional  to  a  ,  but  is  inversely  proportional  to  the  prior 
2  _ 

variance  to  R  +  XVX*.  Thus  the  prior  expectation  will  be  most 

influential  when  our  prior  precision  is  great  relative  to 

experimental  error;  when  our  prior  precision  is  not  great,  the  data 

will  dominate  the  prior  in  determining  Y. 

The  quadratic  penalty  term  undergoes  several  interesting 

changes  in  the  limiting  case  of  (5.6).  First,  the  penalty  depends 

2 

on  the  variance-bias  tradeoff  parameter  r  but  not  on  a  .  Second, 
the  penalty  is  independent  of  the  prior  expectation  Xflg.  Third, 
the  penalty  is  0  only  for  those  vectors  of  predicted  values  which 
are  in  the  column  space  of  X;  i.e.,  for  those  vectors  of  predicted 
values  which  can  be  exactly  written  as  a  graduating  function.  The 
meaning  of  these  last  two  points  is  that  the  penalty  no  longer 
induces  shrinkage  toward  a  particular,  pre-specif ied  vector  of 
predicted  values;  rather,  in  a  more  general  sense,  there  is 
shrinkage  toward  the  response  plane  spanned  by  the  graduating 
function.  Finally,  equation  (5.6)  is  an  exact  discrete  analogue  of 
(2.5),  the  continuous  smoothing  problem  that  leads  to  generalized 
spline  estimation.  Thus  Theorem  5.3  further  illustrates  the  close 
link  between  spline  estimation  and  the  Bayesian  models  when  a 
diffuse  prior  is  assigned  to  the  regression  coefficients. 
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Theorem  5.3  can  also  be  used  to  show  how  the  residual  sum  of 


squares  depends  on  the  choice  of  t .  Let  us  denote  the  vector  of 
predicted  values  by  Y(r)  to  emphasize  its  dependence  on  t.  Then 
define  the  residual  sum  of  squares  function  by 

A  A 

RSS(t)  =  [y  -  T( t ) ] 1 [y  -  Y(T )] . 

Corollary:  If  X  has  full  column  rank  and  the  regression 
coefficients  are  assigned  an  improper  prior,  then  RSS(t)  is  a 
monotone  decreasing  function  of  t,  with  RSS(O)  equal  to  the 
residual  sum  of  squares  from  fitting  the  graduating  function  by 
ordinary  least  squares.  If,  in  addition,  R  is  non-singular, 
then  RSS (“)  =  0. 


| 


#  i 
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6.  PRECISION  OF  THE  ESTIMATES 


Measures  of  precision  for  the  Bayesian  model  described  in 
Section  2  can  be  obtained  in  exactly  the  same  fashion  as  the 
estimates  derived  in  the  preceding  sections.  Since  the  posterior 
distributions  of  g(x)  and  of  0  are  normal,  the  natural  measures 
of  precision  are  the  corresponding  posterior  variances,  which  we 
give  in  the  following  theorem. 

Theorem  6. 1  Given  the  model  (2.2)-(2.3),  the  posterior  variance 
of  g(x)  is: 

•y 

Varfg(x)/Y*y}  -  to  R(x,x)  +  f ' (x)VT(x) 

-  [T02r*(x)  +  f' (x)VX' ] [o2I  +  to2R  +  EVE']-1 

*y 

X  [to  r{x)  +  EVf ( x)  ] .  (6.1) 

The  posterior  variance  matrix  for  S  is: 

Var{B/Y«y}  -  V  -  VX' (a  I  +  to  R  +  EVE')  EV.  (6.2) 

Proof:  The  proof,  like  that  of  Theorem  3. 1,  follows  from  standard 
properties  of  multivariate  normal  vectors. 

The  posterior  variance  of  g(x),  like  the  posterior  mean,  can 
be  decomposed  into  several  terms.  Following  equation  (2.2), 
Var{g(x)/*«y[  *  Var(f'{x)3  +  n^/T-y} 

m  f' (x)Var{B/TTr}f(x)  +  Varln^/Tr} 

+  2f ' (x)Cov{B,nx/Y«y} .  (6.3) 

Once  again,  it  is  particularly  interesting  to  study  the  special 
case  when  the  regression  coefficients  are  assigned  an  improper 
prior.  As  we  noted  in  Section  4,  the  posterior  distributions  are 
still  normal,  provided  the  X  matrix  has  full  column  rank.  In  the 
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following  theorem,  we  derive  the  posterior  variances. 


Theorem  6.2:  Suppose  X  has  full  column  rank  and  an  improper  prior 
is  assigned  to  the  regression  coefficients  of  (2.2)-(2.3).  Then  the 
posterior  variance  of  g(x)  is: 

lim  Var{g(x)/Y=y}  -  <j2{tR(x,x)  +  f' (x)(X'M  \)  ^(x) 

v~\  o 

-  2-rr'  (x)*"1X(X'Il"1X)"1f(x) 

-  t^r*  (x) [it""1-  M"1X(X,M-1X)“1X,ll_1]r{x)}, 

(6.4) 

where  M  «  I  +  tr.  The  posterior  variance  matrix  of  0  is: 

lim  Var{0/Y«yf  =  a2(X,*"1X)"1 .  (6.5) 

V  +  0 

Proof :  Hie  proof  is  given  in  the  Appendix. 

When  the  regression  coefficients  are  assigned  an  improper 

prior,  we  found  in  Section  4  that  the  posterior  means  are 
2 

independent  of  o  ;  we  see  above  that  the  posterior  variances  are 

o 

proportional  to  c  with  a  constant  of  proportionality  that  is  a 
function  of  T,  the  covariance  function  R(o,y),  the  experimental 
design  and,  in  the  case  of  g(x),  the  point  x.  The  posterior 
variance  matrix  for  0  is  the  same  matrix  that  would  result  from  a 
classical  sampling  theory  model  in  which  the  error  terms  are 
correlated  with  E{ee'}  *  a2 (I  +  TR).  Although  the  posterior 
variances  do  depend  on  the  experimental  design,  they  are  independent 
of  the  observed  responses.  For  the  special  case  when  R  is 
non-singular,  we  note  that  Wahba  (1983,  Theorem  2)  gives  an 
alternative  expression  for  (6.4)  .  The  equivalence  of  her  formula 

i 

j 


and  (6.4)  can  be  verified  by  straightforward,  but  tedious,  algebra. 

It  is  quite  difficult  to  describe  precisely  how  the  posterior 
variances  behave  as  functions  of  the  bias  covariance  parameters,  the 
design,  and  the  estimation  site.  Some  general  conclusions,  however, 
can  be  reached.  The  following  theorem  describes  how  the  posterior 
variances  depend  on  T . 

Theorem  6.3:  Assume  the  data  are  modeled  by  (2.2)-(2.3).  Then  the 
following  conclusions  hold: 

(i)  The  posterior  variance  of  g(x)  is  a  monotone  increasing 
function  of  t . 

(ii)  The  posterior  variance  of  g(x)  obtains  a  minimum  value  of 

02f'(x)(X’X  +  ffW'flx) 

when  T  =  0.  If  the  regression  coefficients  are  assigned  an 
improper  prior,  the  minimum  value  is 

02f,<x)<X’X)"1f(x). 

(iii)  If  x  is  a  design  point,  then: 

Var{g(x)/T=y}  <  o  . 

(iv)  Suppose  the  design  includes  m  distinct  points,  Xj,....,^. 

Denote  by  Y  the  mxl  vector  of  average  responses  at  the  distinct 

design  points;  i.e.,  =  average  of  observations  at  x^.  Denote 

-  2~ 

by  to  R  the  bias  covariance  matrix  of  Y  and  denote  by  to  Rjj 
the  bias  covariance  matrix  of  (5’,g(x)).  If  both  R  and  5^  are 
non-singular,  then  the  posterior  variance  of  g(x)  diverges  to 
infinity  as  t  ♦  ». 

(v)  The  posterior  variance  of  B  is  also  a  monotone  increasing 
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function  of  T;  i.e.,  if  we  denote  <6. 2)  by  D(T)  to  emphasize  its 
dependence  on  t,  then  D(t2)  -  DC^)  is  a  positive  semi-definite 
matrix  whenever  Tj  > 

(vi)  The  minimum  posterior  variance  of  B  is  attained  when  t  *  0 
and  is 

cr2(X'X  +  ffV1)'1. 

If  the  regression  coefficients  are  assigned  an  improper  prior,  the 
minimum  value  is 

a2(X'X)_1. 

Proof:  The  proof  is  given  in  the  Appendix. 

The  monotonicity  property  proved  in  Theorem  6. 3  is  intuitively 
appealing.  If  we  recall  that  T  reflects  the  suspected  extent  of 
bias  relative  to  experimental  error,  then  increasing  t  (all  else 
fixed)  corresponds  to  positing  a  model  in  which  the  effect  of  bias 
is  more  severe.  As  we  might  expect,  such  an  assumption  leads  to  a 
degradation  in  the  precision  of  the  estimates.  In  particular. 
Theorem  6.3  allows  us  to  compare  models  that  include  a  bias  term 
(T  >  0)  with  models  that  include  only  a  graduating  function 
(t  *  0).  The  (tentative)  assumption  that  a  particular  graduating 
function  exactly  describes  the  response  function  must  lead  to  a  more 
optimistic  assessment  of  the  pre^ 1 sion  of  the  estimates  than  would 
be  obtained  if  possible  bias  were  ex  >licitly  included  in  the 
model.  In  particular,  the  i^atio  i  variances  provided  by  ordinary 
least  squares  (OLS)  regression  analysis  are  identical  to  the 
posterior  variances  when  t  ■  0  and  the  regression  coefficients  are 


assigned  an  improper  prior.  Hi  us  OLS  leads  to  a  "best  case" 
assessment  of  precision  that  may  be  unduly  optimistic. 

Another  interesting  conclusion  from  Theorem  6.3  concerns  the 

contrasting  behavior  of  the  posterior  variance  of  g(x)  at  design 

2 

points  and  at  other  points.  Whereas  the  former  is  bounded  by  a  , 
the  latter  may  diverge  to  infinity  as  T  ♦  •.  The  key  term  in 
determining  the  limiting  behavior  of  Var{g(x)/Y=y|  is 
Varfhjj/TTr} /  the  posterior  variance  of  the  bias  term  at  x,  which 
depends  on  R(m,w)  and  the  experimental  design  used.  If  the 
posterior  variance  of  the  bias  is  positive,  then  Var{g(x)/Y=y} 
will  diverge  to  infinity  as  t  ♦  “.  In  general,  the  only  way  to 
assure  a  minimal  level  of  precision  for  estimating  g(x)  is  to  take 
an  observation  at  x.  This  contrasts  markedly  with  the  implication, 
when  it  is  assumed  that  no  bias  is  present,  that  maximal  precision 
for  estimating  g(x)  can  sometimes  be  obtained  by  taking 
observations  far  away  from  x. 

Useful  information  about  the  form  of  the  posterior  variances 
can  be  obtained  by  considering  the  vector  of  estimated  values  at 
the  n  design  points.  The  basic  results  are  given  in  the  following 
theorem,  which  is  also  found  in  Wahba  (1983)  for  the  special  case 
when  R  is  non-singular  and  the  regression  coefficients  are 
assigned  an  improper  prior. 

Theorem  6.4:  Define  0  as  in  (2.4)  to  be  the  vector  of  expected 
values  at  the  design  points.  The  posterior  variance  matrix  of  0 
is : 
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(6.6) 


0  A.  0  o  •  1 

Var{6/r»y}  ■  or  i  -  a  (a  I  +  xo  R  +  XVX' ) 

(6.6)  is  monotone  increasing  in  t  (i.e.,  x2  >  x^  implies  that 
the  difference  of  the  respective  variance  matrices  is  positive 
semi-definite)  and  achieves  a  minimum  of 

o2X(X*X  +  a2v"1r1X' 

2 

when  x  ■  0.  If  R  is  non-singular,  (6.6)  converges  to  o  I  as 
x  +  •»,  If  the  regression  coefficients  are  assigned  an  improper 
prior  distribution,  then  the  limiting  posterior  covariance  matrix 
is: 

lim  Var{8/Y-y}  -  <J2{l  -  h"1  +  . 

V  >  0 

(6.7) 


(6.7)  is  also  monotone  increasing  in  x  and  achieves  a  minimum  of 

o2X(X'X)~1X'  when  x  »  0.  If  R  is  non-singular,  (6.7)  also 
2 

converges  to  a  I  as  x  ♦  *. 

Proof:  The  proof  is  given  in  the  Appendix. 

Several  conclusions  can  be  drawn  from  the  limiting  forms  in 
Theorem  6.4.  If  x  -  0,  the  Bayesian  model  produces  the  familiar 
formulas  for  Bayesian  multiple  regression  with  a  proper  prior  (6.6) 
or  for  OLS  regression  (6.7)  when  an  improper  prior  is  employed.  At 
the  other  extreme,  as  x  ♦  R  non-singular  implies  that  the 
posterior  variance  matrix  of  8  is  a  I.  One  possible 
interpretation  of  this  result  is  that  x  +  «•  corresponds  to  a 
situation  in  which  the  experimenter  believes  that  the  response 
function  is  so  erratic  that  observations  at  different  points  are 
essentially  samples  from  independent,  unrelated  distributions.  Thus 
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only  observations  made  at  x  provide  any  basis  for  inference 


f 


about  g(x)  and  the  posterior  variance  of  g(x)  at  each  design 

2 

point  is  equal  to  a  ,  the  variance  of  the  observation  made  there. 

7.  ESTIMATION  OF  VARIANCE  PARAMETERS 

Thus  far  we  have  presented  results  for  the  Bayesian  model  of 

2 

Section  2  that  depend  on  the  experimental  error  variance  o  ,  the 
parameter  T  that  reflects  the  magnitude  of  bias  relative  to 
experimental  error,  and  the  form  of  the  bias  covariance  function, 
R(u,w).  In  this  section  we  consider  various  methods  that  might  be 
used  to  make  inferences  about  these  components  of  the  model.  The 
discussion  here  will  be  rather  general?  we  hope  to  present  more 
exact  results  in  a  later  paper. 

7.1  Estimating  a2 

Ideally  we  would  like  to  use  replicate  observations  to 
2 

estimate  o  .  Just  as  in  ordinary  least  squares  regression, 

replicate  observations  allow  us  to  form  a  "pure  error"  sum  of 

2 

squares  that  provides  an  estimate  of  o  independent  of  any 
assumptions  about  the  nature  of  the  response  function.  Bayes 

2 

estimates  can  be  obtained  by  assuming  a  prior  distribution  for  a 

and  combining  the  prior  with  the  information  in  the  replicates. 

If  there  are  no  replicates  in  the  experimental  design, 

2 

estimates  of  o  can  be  based  on  the  vector  of  residuals 

A 

e  »  y  -  T  from  the  estimated  response  function.  Wahba  (1983),  in 


the  context  of  cubic  spline  regression,  suggested  the  estimate: 

A2 

a  =  e'e/tr(B) ,  (7.1) 

where  B  is  the  matrix  that  maps  y  into  e  and  tr  denotes 

matrix  trace.  If  t  =  0,  (7.1)  reduces  to  the  conventional  least 

2 

squares  regression  estimate  of  a  .  The  rationale  for  dividing  by 

tr(B)  is  that  Varje}  =  o2B. 

2 

Bayes  estimates  of  a  can  be  derived  by  assuming  an 

2 

appropriate  prior  distribution  for  a  .  As  one  possibility,  suppose 

2 

we  adopt  a  uniform  prior  for  a  .  Then  the  posterior  density  of 
2 

a  under  (2.2)— (2.3),  up  to  a  proportionality  constant,  is: 


exp{-(T-*3 rfo2!  +  TC^R  +  XV*’]_1(T-18  )/l\ 


p(<J  /T)  « 


{ det ( 0  I  +  TO  R  +  XVX’ )} 


1/2 


(7.2) 


Since  (7.2)  results  in  a  rather  complicated  posterior  distribution 
2 

for  a  ,  we  might,  following  Lindley  and  Smith  (1972),  estimate 
2 

a  by  the  mode  of  the  posterior  density.  Differentiating  (7.2) 

2 

with  respect  to  o  and  equating  to  0  yields  the  equation: 

(T-«0)'C-1(I  +  TR)C_1(Y-180)  -  tr[  (I  +  TR)C_1]  =  0,  (7.3) 

2  2  _ 

where  C  *  a  I  +  to  R  +  XVX1.  In  general,  (7.3)  does  not  provide  a 

2 

closed  form  solution  for  o  .  We  can,  however,  state  a  closed  form 

for  the  special  case  in  which  the  regression  coefficients  are 

assigned  an  improper  prior.  As  in  Section  4,  we  consider  the  limit 

of  the  left  hand  side  of  (7.3)  as  V  ^  converges  to  a  0  matrix. 

Using  Lemma  1  of  Section  4,  it  is  easily  verified  that 
- 1  -2  - 1 

C  +  CJ  B  as  V  +  0,  where  B  is  the  matrix  referred  to  above 


that  maps  y  into  the  residual  vector  e.  Further,  (I  +  tR)B  is 

an  idempotent  matrix  that  projects  into  the  orthogonal  complement  of 

the  column  space  of  X,  so  that  the  second  term  in  (7.3)  converges 

to  (n  -  p)/0  .  Thus,  the  posterior  modal  estimate  when  the 

regression  coefficients  are  assigned  an  improper  prior  is: 

a2  =  e'(I  +  TR)e  /  (n  -  p).  (7.4) 

The  estimate  (7.4)  is  quite  similar  to  Wahba's  estimate  (7.1);  (7.4) 

modifies  (7.1)  by  including  (I  +  TR)  in  both  the  numerator  and  the 

denominator  to  obtain  a  weighted  sum  of  squares  divided  by  a 

constant  divisor.  In  addition,  (7.4)  is  precisely  the  estimate  of 
2 

o  that  would  result  from  a  conventional  generalized  least  squares 

analysis  of  the  regression  model  Y  =  X8  +  e  when  the  errors  are 

correlated  with  E{ee'}  =  o2(X  +  tr). 

Another  possibility,  again  suggested  by  Lindley  and  Smith 

2 

(1972),  is  to  estimate  0  by  the  mode  of  the  joint  posterior  of 
2 

O  and  8,  rather  than  the  mode  of  the  marginal  posterior  of 

2  2  2 
0  .  If  we  denote  the  prior  densities  of  0  and  8  by  p(o  ) 

and  p(8),  respectively,  then  their  joint  posterior  will  be: 


P(o2,B/Y) 


p(o2)p(8)  exp|-(T-x8)’ (I+tR)_1(Y-X8)/202} 
fdet(I+TR) } 1//2  0n 

(7.5) 


2 

For  each  fixed  value  of  0  ,  the  posterior  distribution  of  8  is 

normal  with  mean  (and  modal)  value  [x1 (I+xR)  ^x]  ^X'(I+tR)  Y.  To 

find  the  joint  modes,  then,  we  need  only  substitute  the  modal  value 

2 

of  8  into  (7.5)  and  maximize  with  respect  to  0  .  It  is  easy  to 
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verify  that,  with  this  substitution ,  the  quadratic  form  in  the 


2 

numerator  of  (7.5)  is  equal  to  e'(I+TR)e.  if,  a  priori,  o  is 

assigned  an  inverse  chi-square  distribution,  then  the  resulting 
2 

function  of  a  will  retain  the  form  of  an  inverse  chi-square 

density.  A  special  limiting  case  of  the  inverse  chi-square  is  the 

2  2 

commonly  used  improper  prior  in  which  p(o  )  «  1 /a  ,  for  which  (7.5) 

leads  to  the  modal  estimate : 

a2  =  e* (I+TR)e  /  (n  +  2).  (7.6) 

If  we  were  to  consider  (7.5)  evaluated  at  the  mode  of  0  as  an 

2 

inverse  chi-square  density  for  a  ,  the  corresponding  mean  estimate 
would  be: 

2  «»  e'  (I+TR)e  /  (n  -  2).  (7.7) 

Wecker  and  Ansley  (1983)  analyzed  the  cubic  spline  model  from  a 

frequentist,  rather  than  a  Bayesian,  perspective  and  derived  a 

2 

maximum  likelihood  estimate  of  o  : 

*2 

a  **  e'(X+tR)e  /  n.  (7.8) 

Hie  derivation  of  (7.8)  is  identical  to  that  of  (7.6)  except  that  no 
prior  is  introduced,  so  that  only  the  likelihood  is  maximized. 

7.2  Estimating  T 

We  have  already  seen  that  the  estimated  response  function  and 
the  precision  of  the  estimates  can  vary  considerably  as  functions 
of  t.  Thus  it  is  clearly  important  to  consider  methods  for 
estimating  t.  The  spline  literature,  in  particular,  has  devoted 
considerable  attention  to  this  problem.  Craven  and  Wahba  (1977) 
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proposed  choosing  t  to  minimize  the  generalized  cross  validation 
( GCV)  function: 

V(T)  =  e'e  /  [tr(B>]2,  (7.9) 

where  e  and  B  are  defined  as  in  Section  7.1.  Since  V(t )  is 
not  a  simple  function  of  t,  a  numerical  search  algorithm  is 
necessary  to  find  the  estimate.  The  GCV  method  is  justified  as  an 
approximation  to  choosing  t  to  minimize  mean  squared  error  and  has 
been  found  to  work  well  provided  the  data  are  not  too  sparse. 

Wecker  and  Ansley  (1983)  proposed  the  use  of  maximum  likelihood 
to  estimate  t,  which  amounts  to  choosing  T  to  minimize: 

e'd+TRle  /  fdet[  ( I+TR)"1  ]  }  1/n  .  (7.10) 

As  with  the  GCV  method,  a  numerical  search  is  necessary  to  carry  out 
the  minimization. 

A  strictly  Bayesian  approach  would  be  to  postulate  a  prior 
distribution  for  x  and  then  to  compute  its  posterior  distribution 
given  the  observed  data.  Moreover,  since  we  have  derived  posterior 
distributions  for  the  response  function  that  are  conditional  on 
t,  it  would  then  be  appropriate  to  derive  the  marginal  distribution 
of  the  response  function  by  averaging  over  the  posterior  of  t . 
Unfortunately,  the  posterior  distribution  will  inevitably  be  quite 
complicated,  so  that  the  averaging  with  respect  to  t  that  is 
contemplated  above  will  probably  be  intractable  analytically.  We 
could,  as  an  approximation,  simply  derive  a  point  estimate  of  T 
and  then  proceed  to  estimate  the  response  function  as  though  that 
were  the  true  value,  a  procedure  that  is  reminiscent  of  empirical 
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Bayes  estimation  (see,  for  example,  Morris  1983).  The  point 

2 

estimate  for  t,  like  that  for  a  ,  could  be  derived  as  the 
posterior  mode  and,  assuming  that  the  posterior  is  dominated  by  the 
likelihood,  would  be  similar  to  Wecker  and  Ansley's  maximum 
likelihood  estimate  (7.10). 

It  is  not  at  all  clear  what  would  constitute  a  reasonable  prior 
distribution  for  T.  Ideally,  a  prior  for  r  should  reflect  the 
experimenter's  beliefs  about  how  severe  the  bias  is  likely  to  be 
relative  to  experimental  error.  Young  (1977)  suggested  that  the 
prior  for  t  be  chosen  from  the  family  of  inverse  chi-square 
distributions,  but  it  is  still  necessary  to  specify  a  prior  mean  and 
variance  for  r.  Another  possibility  would  be  to  assign  an 
"uninformative"  prior  to  T,  an  approach  that  is  often  advocated 
for  parameters  about  which  prior  information  is  weak.  But  what 
should  this  uninformative  prior  be?  One  candidate  is  p(T)  «  1/r, 
since  t  is  a  scale  parameter  in  the  model.  (This  would  also  be 
the  uninformative  prior  from  within  the  inverse  chi-square 
family.)  But  the  model  here  is  a  complicated  one  and,  without 
further  research,  it  is  not  clear  what  priors  will  lead  to 
reasonable  answers. 


7.3  Estimating  F(n,'«r) 

The  covariance  function  R(u,v)  may  be  seen  here  as  a 
high-dimensional  nuisance  parameter  in  which  we  have  little  or  no 
intrinsic  interest  but  which  we  require  in  order  to  estimate  the 


37 


response  function.  Although  non-parametric  estimation  of  R{u,v) 


might  be  possible,  the  common  strategy  has  been  to  propose  some 
simple  parametric  form  such  as  those  described  in  Section  3.  The 
problem  of  estimating  R  is  thus  reduced  to  one  of  estimating  the 
associated  parameters,  which  could  be  accomplished  using  the  methods 
described  in  Section  7.2  for  estimating  T.  The  parameters  in  R, 
however,  typically  enter  the  likelihood  in  a  much  more  complicated 
fashion  than  does  t,  so  that  all  the  comments  in  the  preceding 
section  on  the  difficulty  of  estimating  T  and  of  assigning  it  a 
prior  distribution  will  be  even  more  true  of  parameters  in  R. 
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8.  EXAMPLES 


In  this  section  we  present  two  examples  to  illustrate  the 
Bayesian  response  surface  model.  The  first  example  involves 
simulated  data  with  just  one  explanatory  variable;  the  second 
example  considers  actual  data  from  a  chemical  response  surface 
experiment  with  two  explanatory  variables.  All  the  Bayes  estimates 
described  in  this  section  were  calculated  using  the  MATLAB  matrix 
laboratory  package  (see  Moler  1981). 

8. 1  Simulated  Data 

Suppose  the  true  mean  response  to  a  scaled,  centered 
explanatory  variable  x  is  given  by  the  response  functions 

g(x)  *  x  +  3x/(1  +  x  +  3x2),  (8.1) 

which  is  the  sum  of  a  linear  function  and  an  inverse  polynomial  (see 
Nelder  1966  for  a  discussion  of  inverse  polynomials).  The  graph  of 
this  function  appears  in  Figure  1.  Experimental  data  were  simulated 
by  adding  computer  generated  random  errors  to  15  equally  spaced 
design  points  between  x  *  -1.4  and  x  ■  1.4.  The  random  errors 
were  generated  from  a  normal(0,0.2  )  distribution  via  the  NRANDOM 
command  in  Minitab  (see  Ryan,  Joiner,  and  Ryan  1981).  The  simulated 
data  appear  as  asterisks  in  Figure  1. 

To  model  the  data,  we  used  a  straight  line  as  a  graduating 
function  (assigning  improper  priors  to  its  coefficients)  and  defined 
the  bias  term  by  considering  an  expansion  of  the  response  function 
in  normalized  Hermite  polynomials  (see  Steinberg  1984  for 
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figure  1«  Bayes  estimate  of  the  response  function  g(x)  =  x  +  3x/ (1 
from  simulated  data. 


details).  The  resulting  covariance  function  for  the  bias  (2.3e)  is: 


exp{ -( u-v) ^w^/ ( 1-w^ )  +  2uvw/(1-w)} 


Covfh  ,T)  }  =  T(Ji 

1  U  VJ 


/I  2J/2 
(  1  -V  ) 


(8.2) 


where  w  is  a  smoothing  parameter  that  indicates  the  rate  at  which 

higher  degree  polynomials  in  the  expansion  are  downweighted.  Rather 

than  estimate  w,  we  chose  to  set  w  -  0.4;  similar  results  were 

obtained  with  other  values  of  w.  He  used  generalized  cross 

validation  (7.9)  to  estimate  T;  the  criterion  function  leveled  off 

at  about  r  =  5  and  we  used  this  as  our  estimate  of  t .  (The 

maximum  likelihood  criterion  (7.10)  indicated  that  t  should  be 

much  closer  to  0  and,  in  both  this  and  the  following  example, 

appeared  to  severely  underestimate  t.)  Both  (7.1)  and  (7.4)  were 

2 

considered  as  estimators  of  a  and,  for  the  estimated  value  of 
x,  they  were  in  near  agreement,  giving  estimates  of  .064  and 
.060,  respectively  (for  other  values  of  t,  (7.1)  and  (7.4) 
differed  by  as  much  as  30%). 

The  Bayes  estimate  of  g(x)  is  graphed  in  Figure  1.  It  does 
an  excellent  job  of  reproducing  g  for  low  and  high  values  of  x; 
although  it  is  less  successful  in  the  middle  region,  this  seems  in 
large  part  to  be  the  consequence  of  the  error  sequence,  for  which 
most  of  the  errors  at  low  x's  were  positive  but  most  of  the  errors 
at  high  x's  were  negative.  In  particular,  the  large  positive 
error  term  at  x  =  -0.4  and  the  large  negative  error  at  x  =  0.4 
appear  to  strongly  influence  the  fit  in  the  middle  of  Figure  1.  The 
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Bayes  estimate  certainly  matches  both  the  data  and  the  true  response 
function  far  more  closely  than  would  the  OLS  straight  line  fit  here, 


r 


i 

and  we  think  it  is  a  useful  alternative. 

Figure  2  contrasts  the  variance  of  the  estimated  response 

function  under  the  Bayesian  model  and  under  OLS.  Since  the 

variances  are  symmetric  about  0,  Figure  2  includes  only  positive 

o 

x.  Three  features  are  noteworthy.  First,  for  fixed  a  ,  the 
Bayesian  model  with  bias  always  suggests  larger  variances  than  does 
OLS.  Over  most  of  the  experimental  range,  the  variances  are  two  to 
three  times  larger  with  the  Bayesian  model.  Second,  for  small  x, 
the  Bayesian  variances  actually  increase  more  slowly  than  the  OLS 
variances.  This  phenomenon  can  be  explained  by  remembering  that, 
with  the  Bayesian  model,  most  of  the  information  used  to  estimate 
g(x)  comes  from  nearby  observations.  Thus  g(1/2)  can  be  estimated 
with  almost  as  much  precision  as  g(0)  because  the  distribution  of 
design  points  close  to  1/2  is  about  the  same  as  that  close  to 
0.  Third,  for  x's  near  the  edge  of  and  outside  the  range  of  the 
data,  variances  increase  dramatically  with  the  Bayesian  model. 
Reasonably  precise  estimates  are  possible  only  over  the  range  of  the 
data,  where  we  have  hard  information  on  the  nature  of  the  response 
function?  outside  that  range,  precision  is  markedly  worse.  The 
Bayesian  model,  by  explicitly  stating  that  a  tentative  model  will  be 
subject  to  bias,  leads  to  the  realistic,  albeit  pessimistic. 


conclusion  that  the  data  provide  little  basis  for  extrapolation 


Figure  2i  Variance  of  the  estimated  responses  using  the  Bayesian  model  and 
using  OLS  regression.  (The  variances  are  scaled,  for  convenience,  by  o 


8.2  Chemical  Experiment 


Myers  (1976)  discusses  a  response  surface  experiment  to  model 
yield  of  the  chemical  mercaptobenzothiazole  (MBT)  as  a  function  of 
reaction  time  and  temperature.  The  experiment  used  a  rotatable 
central  composite  design  which  is  commonly  used  for  fitting  second 
order  polynomials.  The  observed  data  are  listed  in  Table  1  and 
shown  in  Figure  3.  The  number  at  the  center  of  the  design  in  Figure 
3  is  the  average  of  the  three  runs  made  there;  also,  there  was  an 
additional  run  made  at  (-1,0)  which  we  have  deleted  from  our 
analysis  in  order  to  preserve  the  central  composite  structure  (our 
results  would  not  change  much  if  the  run  were  included). 

Following  Myers,  we  fit  a  second  order  polynomial  in  time  and 
temperature  to  the  MBT  data  using  OLS.  Figure  4  shows  a  contour 
plot  of  estimated  yields  from  the  OLS  fit.  The  duplicate  runs  at 
the  origin  make  possible  a  standard  test  for  lack  of  fit  of  the 
second  order  polynomial  model.  The  mean  square  error  for  lack  of 
fit  is  39.61  with  3  d.f.  compared  with  a  pure  error  mean  square 
of  0.763  with  2  d.f.  from  the  center  replicates,  indicating  highly 
significant  lack  of  fit.  Inspection  of  the  data  (as  well  as  any 
standard  diagnostics)  reveals  that  the  two  low  yields  in  the 
northeast  corner  of  Figure  3  are  the  major  source  of  discrepancy. 

One  possible  explanation  is  that  these  two  observations  had  unusual 
errors,  but  their  proximity  suggests,  to  the  contrary,  that  they 
accurately  reflect  a  severe  degradation  of  yield  when  temperature 


and  reaction  time  are  too  high.  The  OLS  fit  is  unable  to  account 


Table  Is  Data  from  the  MBT  experiment.  The  explanatory 
variables  have  been  standardized  so  that 

Reaction  time  -  12  minutes 

X1 - 

8  minutes 


Temperature  -  250  degrees  C 


2  *  - 

30  degrees  C 

X2 

Y 

0.71 

-0.71 

81.3 

0.71 

-0.71 

85.3 

•0.71 

0.71 

83.1 

0.71 

0.71 

72.7 

■1.00 

0.00 

83.8 

1.00 

0.00 

81.7 

0.00 

-1.00 

84.7 

0.00 

1.00 

57.9 

0.00 

0.00 

82.4 

0.00 

0.00 

82.9 

0.00 

0.00 

81.2 

Figure  3s  Plot  of  the  MBT  data. 


Reaction  Time 


Va-T 


estimates  of  MBT  yield 


for  this  sharp  degradation  and,  as  a  result,  all  the  OLS  estimates 
are  biased. 

Now  consider  a  Bayesian  model  for  the  MBT  data  that  accounts 
for  the  possible  presence  of  bias.  We  used  a  second  order 
polynomial  in  time  and  temperature  as  the  graduating  function  and 
defined  the  bias  covariance  to  correspond  to  a  two-dimensional 
expansion  using  Hermite  polynomials  (see  Steinberg  1984  for 
details).  The  resulting  covariance  function  is: 


As  in  the  first  example,  w  is  a  smoothing  parameter  that  indicates 
how  rapidly  higher-degree  terms  should  be  discounted.  Similar 
results  were  obtained  for  a  range  of  values  of  w  and  we  elected  to 
set  w  =  0.4  rather  than  attempting  to  estimate  it  from  the  data. 
For  this  choice  of  w,  the  generalized  cross  validation  (7.9) 
estimate  of  t  was  T  =  80.  To  estimate  a  ,  we  computed  both 
(7.1)  and  a  modified  version  of  (7.4),  with  the  pure  error  sum  of 
squares  subtracted  from  the  numerator  and  the  associated  degrees  of 
freedom  subtracted  from  the  denominator.  The  resulting  estimates 
were  0.765  and  0.756,  respectively,  in  close  agreement  with  the 
pure  error  estimate  of  0.763. 

A  contour  plot  of  the  Bayes  estimates  of  MBT  yield  is  given  in 
Figure  5.  The  Bayes  estimates  differ  from  the  OLS  estimates  (in 
Figure  4)  most  noticeably  across  the  top  of  the  plots.  The  Bayes 
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estimates  are  high  in  the  northwest  corner  of  the  plot  and  decrease 
rapidly  as  reaction  time  is  increased,  as  indicated  by  the  closeness 
of  the  contour  lines  in  this  region.  The  OLS  estimates,  on  the 
other  hand,  are  much  higher  in  the  immediate  vicinity  of  the  low 
observations  and  change  gradually  over  the  top  half  of  the  plot. 

The  Bayes  estimates  in  the  northeast  corner  of  the  plot  suggest  that 
yield  will  increase  if  reaction  time  is  increased  still  further,  a 
conclusion  that  seems  implausible.  The  poor  performance  of  the 
Bayes  estimates  in  this  region  is  really  not  surprising:  it  simply 
reflects  the  inability  of  any  model  to  give  accurate  estimates 
throughout  the  range  of  time  and  temperature  when  there  are  so  few 
degrees  of  freedom  to  estimate  the  bias  component.  Also,  it  is 
worth  remembering  that  this  region  is  situated  outside  the  circular 
design  region,  so  the  Bayes  estimates  there  have  very  large 
variances.  Within  the  design  region,  we  think  the  Bayes  estimates 
are. superior  to  the  OLS  estimates. 


9.  DISCUSSION 

In  this  paper,  we  have  analyzed  a  Bayesian  model  for  estimating 
response  surfaces.  The  key  feature  of  the  Bayesian  model  is  the 
inclusion  of  a  term  that  explicitly  represents  the  bias  that  arises 
when  a  simple  graduating  function  is  used  to  provide  a  local 
approximation  to  a  complex  response  function.  The  resulting 
estimate  of  the  response  function  includes  an  estimate  of  the 
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graduating  function  and  a  "correction  for  bias"  term  that  lends  it 
the  flexibility  to  accurately  represent  data  when  the  graduating 
function  alone  is  inadequate.  We  have  derived  measures  of  precision 
for  the  Bayes  response  surface  estimates  and  have  shown  that  OLS  may 
lead  to  an  overly  optimistic  assessment  of  precision  if  bias  is 
present . 

Hie  bias  term  in  the  model  is  described  in  terms  of  a  prior 
probability  distribution  and  we  have  emphasized  how  the  nature  of 
that  distribution,  in  particular  its  covariance  function  (2.3e), 
determine  the  nature  of  the  estimate  of  the  response  surface.  We 
have  reviewed  suggestions  in  the  literature  for  defining  realistic 
covariance  functions  and  methods  that  have  been  proposed  for 
estimating  covariance  parameters.  We  believe  that  there  ia  much 
room  for  useful  research  on  this  topic. 

In  building  empirical  models,  we  usually  seek  some  compromise 
between  the  goals  of  good  fit  and  model  simplicity.  Occasionally, 
we  find  ourselves  in  the  ideal  situation  where  good  fit  can  be 
achieved  with  a  simple  model,  but  such  models  often  prove  too  rigid 
to  provide  a  good  fit.  We  think  that  the  Bayesian  model  discussed 
here  offers  a  useful  approach  for  introducing  additional  flexibility 
in  empirical  modeling. 
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APPENDIX 


This  section  contains  proofs  for  some  of  the  theorems  stated  in 
the  text. 

Theorem  4.1;  Suppose  X  has  full  column  rank  and  an  improper 
prior  is  assigned  to  the  regression  coefficients  in  (2.2)-(2.3). 
Then: 

g(x)  =  lim  e| g{ x) /T=yl  (4.1) 

+  0 

-  f'(x)[x'M"1x]"1X'Il"1y  + 

T«J2r'  (x^lf1-  M_1xfx'll~1x]'1X*ll'1}y, 

where  H  *  (X  +  tK),  and 

lim  e{  8/Y=y }  =  [x,M~1x]“1X,M"1y 

V~  ♦  0 

Proof :  From  Theorem  3.1, 

E{g(x)/T=y}  =  f'(x)80  +  [toV  (x)  +  f '  ( x)  VX*  ] 

2  2  -1 
x  (o  I  +  to  R  +  XVX' )  (y  -  XP0) 

=  f'(x)80  +  f' (x)VX'(o2l  +  to2r  +  XVX' )_1(y  -  x8 Q ) 

2  2  2  “1 
+  TO  r'<x)(o  I  +  to  R  +  XVX')  (y  -  XpQ) 

and  applying  Lemma  2  to  the  first  line  and  Lemma  1  to  the  second 

line  of  the  last  expression, 

-  f'(x)8Q  +  f'  (xjfo^X'M^X  +  V~1]"1o'2X'lT1(y-XP0) 

+  To2rMx)|o-V1-  o-Vxfo'W’x  +  V-1]"1 
x  o“2X'll‘1}(y-XB0) 

and  provided  X  has  full  column  rank,  as  V*1  +  0,  this  converges 
to: 

f'(x)80  +  f (X)[x'«_1x]‘1x’«“1(y-xp0) 

+  TrMxllM*1-  m"1x[x,«"1x]“1X'Ii'1}  (T-»P0) 
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=  f'  (x)[x*M-1x]"1X'lT1y 

+  Tr 1  ( x)  I**""1  -  H“1X[X'M"1X]“1X'M-Ijy. 

The  proof  of  (4.2)  is  contained  in  the  proof  of  (4.1). 

Theorem  5.1:  Given  the  model  (2.2)-(2.3),  suppose  that  the  matrix 

B  derived  from  R  by  eliminating  identical  rows  and  columns  is 

non-singular.  Then: 

% 

Y^  =  average  of  all  observations  at  x^ .  (5.2) 

Proof :  Denote  the  distinct  design  points  by  x^,.... and  denote 
by  Y  the  mxl  vector  of  average  responses  at  the  design  points. 

The  sampling  density  of  Y  can  be  factored  into  the  density  of  Y 

times  the  density  of  Y  given  Y.  The  latter  is  clearly 
independent  of  g(x),  so  we  can  compute  the  posterior  distribution 
of  g(x)  by  conditioning  only  on  Y.  The  sampling  distribution 
of  Y,  conditional  on  0  and  n,  is: 

Y  ~  N(X0  +  n,  02 D), 

where  X  is  the  mxp  matrix  whose  ith  row  is  f'fx^),  *1  is 

the  mxl  vector  of  bias  terms  at  the  distinct  design  points,  and 
D  is  a  diagonal  matrix  whose  1th  entry  is  1/n^,  where  n^  is 

the  number  of  observations  at  x^.  The  prior  covariance  matrix  of 

i)  is  to2R,  where  R  is  precisely  the  matrix  obtained  by 
eliminating  duplicate  rows  and  columns  of  R.  if  we  recompute 
(3.2),  (4.1),  and  (4.9)  conditioning  on  Y,  we  obtain  (5.2)  directly 
from  (4.9)  under  the  assumption  that  R  is  non-singular. 
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Theorem  5.3:  Y  solves  the  minimization  problem:  find  u  to 


minimize 

(u-y)'(n-y)  +  <o-X0o  )  'o2(to2R  +  XVX'  )-1  (o-XpQ) .  (5.5) 

If  R  is  non-singular,  then  (5.5)  converges  to: 

(u-y)'(u-y)  +  t"1^  [r"1-  r“1X(R,r“1X)"1X,R~1]u  (5.6) 

as  V-^  •+■  0.  The  second  term  in  (5.6)  is  0  if  and  only  if 
n  e  col(X) . 

Proof :  Denote  by  0  the  vector  of  expected  values  at  the  design 
points,  as  in  (2.4).  Then,  by  definition,  Y  is  the  posterior 
expectation  of  0.  From  (2.4),  the  prior  distribution  of  0  is: 

0  ~  N(X0q,TO2R  +  XVX* ), 

and  the  sampling  distribution  of  Y  given  0  is: 

Y/0  ~  N(0,O2I). 

Applying  Bayes'  Theorem,  the  posterior  density  of  0  is 
proportional  to: 

exp{-f  a-2(y-0) ' (y-0 )  +  (0-J®Q ) * (ra2R  +  XVX' )_1 (0-XBo)]/2} 

=  exp{-Q(0)/2} , 

which  corresponds  to  the  density  for  a  normal  distribution.  Thus 

A 

the  posterior  mean  of  0,  Y,  can  be  found  by  minimizing  the 

quadratic  form  Q(@),  and  that  is  clearly  equivalent  to  minimizing 

(5.5).  We  can  easily  derive  (5.6)  from  (5.5)  by  applying  Lemma  1  of 

2 

Section  4  with  A  »  TO  *.  If  u  e  col(X),  then  a  =  Xy  for  some 
vector  Y  and  it  is  easy  to  verify  that  the  second  term  in  (5.6)  is 
0.  To  prove  the  converse,  note  that  the  second  term  has  the  form 
u'bi,  where  the  matrix  L  is  obtained  as  the  limit  of  the  matrix 


J 
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in  the  second  term  of  (5.5).  This  latter  matrix  is  positive 
definite,  so  L  must  be  positive  semi-definite.  Thus  u'lu  =  0 
implies  Lu  =  0  and  0  =  RLu  =  Pu,  where 

P  *  RL  =  I  -  XU1*'1!)"1!1*"1  is  an  idempotent  matrix  that  projects 
into  the  orthogonal  complement  of  col(X).  Thus  Pu  =  0  implies 
that  u  e  col ( X) ,  as  claimed. 

Theorem  6. 2 ;  Given  (2.2)-(2.3), 

lim  Var{g(x)/Y=y}  =  02{tR(x,x)  + 

o 

-  2-rr'  (x)M"1X(X'll“1X)~1f(x) 

-  T2r,(x)[n"1-  M_1X(X,B_1X)'1X,ll_1]r(x)}  . 

(6.4) 

lim  Var{ B/T=y}  =  a2 (X’*"1!)"1 .  (6.5) 

v  +  0 

Proof :  Applying  Lemma  1  of  Section  4  in  reverse  to  (6.2)  yields  the 
identity  s 

V  -  VX'  (&  I  +  TO  R  +  XVX' )”  XV  =  [o  x**-  X  +  v”  ]“ 
and  (6.5)  follows  immediately.  To  prove  (6.4),  note  that  we  can 
rewrite  (6.1)  as: 

Var{g(x)/T=y}  »  to  R(x,x)  +  f' (x)[v  -  VX* (0  M  +  XVX’)”  Xv]f(x) 

-  2T02r' (x)[o2M  +  XVX' ]_1XVf(x) 

-  T2o4r* (x)[o2M  +  XVX']-1r(x). 

Now  apply  the  above  identity  to  the  first  line,  Lemma  2  of  Section  4 
to  the  second  line,  and  Lemma  1  of  Section  4  to  the  third  line. 
Taking  limits  as  V-1  +  0  yields  (6.4). 
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Theorem  6.3:  Given  the  model  (2.2)-(2.3): 


(i)  The  posterior  variance  of  g(x)  is  a  monotone  increasing 
function  of  r . 

(ii)  The  posterior  variance  of  g(x)  obtains  a  minimum  value  of 

o2f ' (x) (X'X  +  a2V-1)-1f(x) 

when  t  ■  0.  If  the  regression  coefficients  are  assigned  an 
improper  prior,  the  minimum  value  is 

o 2f ' (x)(X'X)“1f(x). 

(iii)  If  x  is  a  design  point,  then: 

A  A 

Var{g(x)/Y=y }  <  a  . 

(iv)  If  both  R  and  are  non-singular,  then  the  posterior 

variance  of  g(x)  diverges  to  infinity  as  t  ♦  ». 

(v)  The  posterior  variance  of  0  is  a  monotone  increasing  function 
of  T. 

(vi)  The  minimum  posterior  variance  of  0  is  attained  when  T  »  0 
and  is 

a2(X'X  +  oW\ 

If  the  regression  coefficients  are  assigned  am  improper  prior,  the 
minimum  value  is 

a2(X'X)“1. 

Proof •  Results  (ii)  and  (vi)  are  trivial.  To  prove  (i),  denote 

by  W(t)  the  prior  covariance  matrix  of  (T’ ,g(x)).  This  matrix  has 
2  2 

the  form:  W(t)  «  o  I*  +  to  ■*  +  X*VX* ' .  The  posterior  variance 
of  g(x)  is  simply  the  inverse  of  the  lower  right-hand  corner 
element  of  W(T)”1:  that  is. 
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Varf  g(x)/Y=y}  =  f  T  >  ~\+1  ] 

where  vn+^  is  a  unit  vector  whose  (n+ 1  )st  element  is  1  and  whose 
remaining  elements  are  0.  Now,  suppose  T2  >  T1  >  Then 

IHtj)  ~  W(t^)  will  be  positive  semi-definite,  because  R*  is 
positive  semi-definite,  and  W(t^)  ^  -  WfTj)  1  will  also  be 
positive  semi-definite.  Therefore: 

0  ‘  Vi^'V''  -«T2,'1l  Vi 

=  [ Var{ g{ x) /T=y,T 1 } ]  1  -  [ Var{g(x)/T=y,t2} ]  1 
which  implies  that 

VarjgUJ/^y,-^}  <  Var{  g(x)/Y=y,T2} , 
proving  the  claimed  monotonicity  property. 

(iii)  It  will  suffice  to  consider  the  posterior  variance  at  x^. 

2 

Conditional  on  g(Xj),  Y^  has  a  normal  (g(x.,),a  )  distribution,  so 

the  conditional  distribution  of  g(x.j)  given  Y1f  which  we  know  to 

2 

be  normal,  has  a  variance  of  at  most  a  .  Thus: 

Var{g(x1 )/T=y}  <  Var{g(x^ )/Y1=y1 }  <  a2. 

(iv)  As  in  Theorem  5.1,  we  can  compute  the  posterior  variance  of 
g(x)  by  conditioning  only  on  the  replicate  averages,  which  yields  a 
formula  analogous  to  (6.1)  but  with  R  in  place  of  R,  an  mxl 
vector  ?(x)  in  place  of  r(x),  and  a  diaqonal  matrix  D  in  place 
of  I,  where  ^  =  1/n^,  and  n^  is  the  number  of  observations 
at  the  ith  design  point.  Rearranging  terms,  we  obtain: 

Var{g(x)/Y=y}  =  f ' (x) [ Var{ 0/T=y} ] f (x) 


,-1 


-  2To‘ir,  (x)[aZD  +  to2r  +  xvx*  ]  'xvf(x) 


i-1 


+  tozR(x,x)  -  t2a^r' (x)[a^D  +  to^r  +  XVX']_1r(x). 


1- 
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The  first  line  of  the  last  expression  is  clearly  non-negative, 
regardless  of  the  value  of  t.  The  second  term  can  be  rewritten 
-  2o2r‘ (x)[t_102d  +  a2R  +  x'1XVX' j^xvf (x) 
and  converges  to  the  finite  limit  -2? (x)it~1XVf (x)  as  r  ♦  «■>, 
when  R  is  non-singular.  The  final  line  can  be  rewritten 
to2{r(x,x)  -  o2r'(x)[T  ^a2D  +  a2R  +  r  ^XVX']  V(x)}. 

If  R  is  non-singular,  the  term  in  curly  brackets  converges  to 

R(x,x)  -  r*  (x)R  Vfx)  as  t  ♦  «,  and  if  is  non-singular, 

this  expression  is  positive.  Thus,  if  both  matrices  are 
non-singular,  the  last  line  tends  to  infinity  as  t  *,  so  that 
Var{g(x)/Y«y}  diverges  to  infinity  as  r  ■*■  <*>. 

(v)  The  proof  of  (v)  follows  the  same  lines  as  the  proof  of  (i). 
Theorem  6.4:  The  posterior  variance  matrix  of  B  is: 

Var{e/Y«y}  >=  0  I  -  a  (o  I  to  R  +  XVX'  )  .  (6.6) 

(6.6)  is  monotone  increasing  in  T  and  achieves  a  minimum  of 

o2X(X'X  +  o2^1)-1!'  when  t  -  0.  If  R  is  non-singular,  (6.6) 

2 

converges  to  o  I  as  t  ♦ 

lim  Var{e/Y-y}  -  o2{l  -  M_1  +  *t-1x[  X'lT’x]  "VlT1} . 

v~  ♦  0 

(6.7) 

(6.7)  is  also  monotone  increasing  in  t  and  achieves  a  minimum  of 
2  - 1 

a  X(X’X)  X'  when  t  ■  0.  If  R  is  non-singular,  (6.7)  converges 
to  02I  as  t  ♦  ®. 

Proof:  Prom  (2.4),  the  joint  distribution  of  (B',Y')  is 
multivariate  normal  with  covariance  matrix: 
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2 

TO  R  +  XVX* 

2  ~"~ 

TO  R  +  XVX' 

From  standard  properties  of  multivariate  normal  distributions, 

Var{0/Y=y)  =  to2r  +  XVX'  -  (to2r  +  XVX' ) 

x  (o2l  +  to2r  +  XVX' )-1(to2r  +  XVX'). 

2  2 

Mow  add  and  subtract  o  I  to  each  of  the  (to  R  +  XVX1)  terms. 
After  some  obvious  cancellations,  we  obtain  (6.6).  Let  V(0,t) 
denote  (6.6)  as  a  function  of  T.  If  x ^ 

V(6,T2)  -  Vffl,^)  = 

o4[(o2i  +  ti02r  +  XVX')-1  -  (o2i  +  t2o2r  +  XVX')"1]. 

Here  the  matrices  in  the  square  brackets  not  inverted,  their 
difference  would  clearly  be  negative  semi-definite >  with  the 
inverses,  then,  the  difference  is  positive  semi-definite,  as 
claimed.  The  minimum  is  then  achieved  when  t  =  0  and  is 
V( 0, 0 )  =  02I  -  04(02I  +  XVX')-1 

=  02X(X'X  +  02V-1)-1X' 

upon  application  of  Lemma  1  of  Section  4.  The  limit  as  T  +  °° 
follows  immediately  from  (6.6).  We  obtain  (6.7)  from  (6.6)  by 
applying  Lemma  1  of  Section  4  to  the  second  term  and  taking  limits 
as  V-1  ♦  0.  Ttoe  monotonicity  with  respect  to  t  is  unaltered  by 
the  limiting  process  and  the  minimum  value  for  x  =  0  follows  from 

(6.7).  To  obtain  the  limit  of  (6.7)  as  t  +  we  first  take 

- 1  2 

limits  of  the  expression  derived  above  as  V  +  0,  obtaining  o  I 
2 

-  a  B(x),  where  B(x)  is  the  matrix  that  maps  T  into  the 
residual  vector.  If  R  is  non-singular,  we  know  that  the  estimated 


TO  R  ♦  XVX' 


2  2 

a  i  +  to  R  +  XVX' 
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